{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "from statsmodels.api import tsa\n",
    "from dateutil.parser import parse\n",
    "\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.metrics import mean_absolute_error\n",
    "\n",
    "def parse_quarter(string):\n",
    "    \"\"\"\n",
    "    Converts a string from the format YYYYQN in datetime object at the end of quarter N.\n",
    "    \"\"\"\n",
    "    \n",
    "    # Note: you could also just retrieve the first four elements of the string\n",
    "    # and the last one... Regex is fun but often not necessary\n",
    "    year, qn = re.search(r'^(20[0-9][0-9])(Q[1-4])$', string).group(1, 2)\n",
    "    \n",
    "    # year and qn will be strings, pd.datetime expects integers.\n",
    "    year = int(year)\n",
    "    \n",
    "    date = None\n",
    "    \n",
    "    if qn=='Q1':\n",
    "        date = pd.datetime(year, 3, 31)\n",
    "    elif qn=='Q2':\n",
    "        date = pd.datetime(year, 6, 30)\n",
    "    elif qn=='Q3':\n",
    "        date = pd.datetime(year, 9, 20)\n",
    "    else:\n",
    "        date = pd.datetime(year, 12, 31)\n",
    "        \n",
    "    return date\n",
    "\n",
    "\n",
    "alcohol_consumption = pd.read_csv('data/NZAlcoholConsumption.csv', \n",
    "                                  parse_dates=['DATE'], \n",
    "                                  date_parser=parse_quarter,\n",
    "                                  index_col='DATE')\n",
    "alcohol_consumption.sort_index(inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Modelling and Forecasting Time Series\n",
    "## Getting a stationary time series\n",
    "\n",
    "In the previous notebook, you had seen that differencing with a lag of 4 made sense. \n",
    "\n",
    "Let's do that, does the resulting series look stationary?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wine = alcohol_consumption.TotalWine\n",
    "wine_diff = wine.diff(4).dropna()\n",
    "\n",
    "plt.figure(figsize=(8, 6))\n",
    "plt.plot(wine_diff, \"-o\")\n",
    "plt.title(\"Does this look stationary ?\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks stationary enough. Let's call the result `time_series`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "time_series = wine_diff"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Autoregression model\n",
    "\n",
    "In an autoregression model, values are modelled as a linear combination of the $p$ past values. An autoregressive model of order $p$, that is usually indicated as $AR(p)$ model, can be written as:\n",
    "\n",
    "$$\n",
    "y_{t} = c + (\\phi_{1}y_{t-1} + \\phi_{2}y_{t-2} + \\dots + \\phi_{p}y_{t-p}) + e_{t},\n",
    "$$\n",
    "\n",
    "where\n",
    "\n",
    "* `c` is the mean of the time-series\n",
    "* `e_t` is the noise\n",
    "\n",
    "The library `statsmodels` contains a `tsa` module which implements the autoregressive model.\n",
    "\n",
    "**Task**:\n",
    "\n",
    "* Define an `AR` model using `tsa.AR` \n",
    "* select the order using the `select_order` method. You will need to specify a maximum `p` to consider and a criterion for deciding which model is \"best\" \n",
    "  * AIC and BIC are common parameters which encourage \"good fit\" while penalising having too many parameters (complex model). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#com+ add your code here\n",
    "ar = tsa.AR(time_series)\n",
    "optlag = ar.select_order(10, ic=\"aic\")\n",
    "print('Optimal p =', optlag)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Having a look at the AR model\n",
    "\n",
    "Let's see if the AR(p) does a good job compared to the original time series. \n",
    "\n",
    "* use the `fit` method specifying the optimal lag found above\n",
    "* use the `predict` method to generate values starting at the optimal lag\n",
    "* plot the predicted results and the data, how does it look?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "#com+ your code here \n",
    "ar_result = ar.fit(maxlag=optlag)\n",
    "prediction = ar_result.predict(start=optlag)\n",
    "\n",
    "plt.figure(figsize=(12, 4))\n",
    "plt.plot(time_series, '-o', label='true')\n",
    "plt.plot(prediction, '-o', label='model')\n",
    "plt.legend(fontsize=12);\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As for the linear regression, we may want to look at the \"size\" of the residuals. For example you could display the Mean Absolute Error (MAE) using `mean_absolute_error` and feeding the original time series from `optlag` onwards and compare it with the prediction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#com+ add your code here to show the MAE\n",
    "print('MAE = {0:.3f}'.format(mean_absolute_error(time_series[optlag:], prediction)))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Autoregression with Sklearn\n",
    "\n",
    "Autoregression can also be implemented using `sklearn`. However, it doesn't provide direct support to handle time series, which means that you have to reorganise the data before estimating the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def organize_data(to_forecast, window, horizon=1):\n",
    "    \"\"\"\n",
    "     Input:\n",
    "      to_forecast, univariate time series organised as numpy array\n",
    "      window, number of observations to use in a forecast window\n",
    "      horizon, horizon of the forecast\n",
    "     Output:\n",
    "      X, a matrix where each row contains a forecast window\n",
    "      y, the target values for each row of X\n",
    "    \"\"\"\n",
    "    shape = to_forecast.shape[:-1] + (to_forecast.shape[-1] - window + 1, window)\n",
    "    strides = to_forecast.strides + (to_forecast.strides[-1],)\n",
    "    X = np.lib.stride_tricks.as_strided(to_forecast,\n",
    "                                        shape=shape,\n",
    "                                        strides=strides)\n",
    "    y = np.array([X[i+horizon][-1] for i in range(len(X)-horizon)])\n",
    "    return X[:-horizon], y\n",
    "\n",
    "\n",
    "X, y = organize_data(time_series, optlag)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now fit an autoregressive model which simply corresponds to a `LinearRegression` now:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#com+ add your code here to fit a linear regression and show the results\n",
    "lr = LinearRegression()\n",
    "lr.fit(X, y)\n",
    "prediction = lr.predict(X)\n",
    "\n",
    "plt.figure(figsize=(12, 4))\n",
    "plt.plot(time_series.values, '-o',)\n",
    "plt.plot(np.arange(optlag, len(time_series)), prediction, '-o', label='prediction')\n",
    "plt.legend(fontsize=12);\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results are the same, yay!\n",
    "\n",
    "***TIP***: Now that you know how to implement autoregression with sklearn, you're also able to create custom autoregressive models using other regressors implemented in sklearn."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ARMA model\n",
    "\n",
    "AR models a point in the time series as a linear model of the previous values. The mismatch $e_t$ is assumed to be \"noise\".\n",
    "However there could still be information in the series of $e_t$! How about we add the past errors as additionnal features?\n",
    "\n",
    "This leads to the **ARMA** model with an autoregressive part that you will recognise and a part that corresponds to a moving average:\n",
    "\n",
    "$$\n",
    "y_{t} = c + \\underbrace{ \\phi_{1}y_{t-1} + \\phi_{2}y_{t-2} + \\dots + \\phi_{p}y_{t-p} }_{AR(p)} + \\underbrace{ \\theta_{1}e_{t-1} + \\theta_{2}e_{t-2} + \\dots + \\theta_{q}e_{t-q} }_{MA(q)} +e_{t},\n",
    "$$\n",
    "\n",
    "ARMA models are also implemented in `statsmodels` and their implementation is consistent with the one of AR models. \n",
    "\n",
    "* Create an ARMA model with `tsa.ARMA`, specify $p=3$ and $q=3$\n",
    "* Fit and predict\n",
    "* Display\n",
    "\n",
    "Since the result will look almost identical to just using AR, you will want to show the MAE as well. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#com+ add your code here to fit an ARMA model\n",
    "arma = tsa.ARMA(time_series, order=(3, 3))\n",
    "arma_result = arma.fit()\n",
    "prediction = arma_result.predict(start=3)\n",
    "\n",
    "plt.figure(figsize=(12, 4))\n",
    "plt.plot(time_series, '-o', label='true')\n",
    "plt.plot(prediction, '-o', label='model')\n",
    "plt.legend();\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#com+ add your code here to show the MAE\n",
    "print('MAE = {0:.3f}'.format(mean_absolute_error(time_series[3:], prediction)))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Forecasting\n",
    "\n",
    "Until now we have tuned the model to *fit* the time series. \n",
    "This makes us able to find a description of the data, but doesn't give forecast.\n",
    "\n",
    "**Note**: forecasting with time series data is **tricky** and usually basic methods do not really provide very good results (especially on realistic data). ARMA models are nice because they are simple but do not expect fantastic performances. (On the other hand, predicting the future is hard! -- who would have thought).\n",
    "\n",
    "* Separate the time series into a training set and a test set formed of the last 8 points. \n",
    "* Fit an AR model on the training data and try to find the optimal lag using the `BIC` criterion\n",
    "* fit the model, predict and show the prediction on the original time series. Did it do a good job? \n",
    "* compute the MAE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#com+ your code here\n",
    "train = time_series[:-8]\n",
    "test = time_series[-8:]\n",
    "\n",
    "ar = tsa.AR(train)\n",
    "optlag = ar.select_order(10, ic='bic', method='mle')\n",
    "arfit = ar.fit(maxlag=optlag)\n",
    "\n",
    "prediction = arfit.predict(end=len(time_series))[-len(test):]\n",
    "\n",
    "plt.figure(figsize=(8, 6))\n",
    "plt.plot(time_series.values, '-o', label=\"original data\")\n",
    "plt.plot(prediction, '--o', label='prediction')\n",
    "\n",
    "plt.legend(fontsize=12)\n",
    "\n",
    "print('Out of sample MAE = {0:.3f}'.format(mean_absolute_error(test, prediction)))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Cross-Validation for Time Series Forecasting (demo) </h2>\n",
    "\n",
    "In this procedure, there is a series of test sets, each consisting of a single observation. The corresponding training set consists only of observations that occurred prior to the observation that forms the test set. Thus, no future observations can be used in constructing the forecast. \n",
    "\n",
    "Let's build a diagram to illustrate the procedure:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_size = 4\n",
    "\n",
    "plt.figure(figsize=(15, 10))\n",
    "for i, split in enumerate(range(len(time_series)-test_size, len(time_series))):\n",
    "    plt.subplot(test_size, 1, i+1)\n",
    "    plt.plot(time_series.values[:-test_size+i], \n",
    "             '-o', label='train set at fold %d' % (i+1))\n",
    "    plt.plot(range(len(time_series)-test_size-1+i, len(time_series)), \n",
    "             time_series.values[-test_size-1+i:], \n",
    "             '--', color='gray', label='remaining data')\n",
    "    plt.plot(split, time_series.values[split], \n",
    "             'o', label='test point at fold %d' % (i+1))\n",
    "    plt.xlim([0, len(time_series)])\n",
    "    plt.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now implement the cross-validation training and testing the autoregressive model at each step:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "start_split = 4*8\n",
    "predictions = []\n",
    "\n",
    "for split in range(start_split, len(time_series)):\n",
    "    train = time_series[:split]\n",
    "    test = time_series[split:]\n",
    "    ar = tsa.AR(train)\n",
    "    ar_result = ar.fit(maxlag=4)\n",
    "    prediction = ar_result.predict(end=len(time_series))[-len(test):]\n",
    "    predictions.append(prediction.values[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(predictions), len(time_series)-start_split\n",
    "plt.plot(time_series.values, '-o', label='true')\n",
    "plt.plot(range(start_split, len(time_series)), predictions, \n",
    "         '-o', label='out of sample prediction')\n",
    "plt.legend();"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(mean_absolute_error(time_series.values[start_split:], predictions))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wrap up\n",
    "- Using lags of the target variable as feature one can build an autoregressive model\n",
    "- Adding moving average terms and differencing lead to ARIMA\n",
    "- The order of the ARIMA models can be hard to tune\n",
    "- Time series models can be used to describe the data and make forecast\n",
    "- Cross validation can be used to estimate the error of the forecast"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
